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Abstract. To analyze the information theoretical derivation of Tsirelson's bound 
based on information causality, which is defined in terms of the efficiency of random 
access coding, we introduce a generalized mutual information between a classical 
system and a general probabilistic system. The generalization is based on the 
consideration of the classical information capacity of a general probabilistic system. 
We show that the chain rule of the generalized mutual information is essential for the 
derivation of Tsirelson's bound. By using the mutual information, we formulate a 
principle, which we call 11 no- super signalling condition" , that the assistance of nonlocal 
correlations does not increase the capacity of classical communication. We prove that 
this condition is equivalent to the no-signalling condition, and as a result, we show 
that information causality is essentially a matter of a single party. 
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1. Introduction 

One of the most counterintuitive phenomena that quantum mechanics predicts is 
nonlocality. As it is known as the violation of the Bell inequalities [1], the statistics of 
the outcomes of the measurements performed at two space-like points on an entangled 
state can exhibit strong correlations that cannot be described within the framework 
of local realism. It is also known that, despite this fact, quantum correlations still 
satisfies the no-signalling condition, i.e., the condition that they cannot be used for 
superluminal communication, which is prohibited by the law of special relativity. The 
quantum violation of the Clauser-Horne-Shimony-Holt (CHSH) inequality [2] is limited 
above by Tsirelson's bound [9]. In a seminal paper [3], Popescu and Rohrlich showed that 
Tsirelson's bound is strictly lower than the limit imposed by the no-signalling condition 
alone. This result raised a question why the strength of nonlocality is limited in the 
quantum world. If we could find an operational principle rather than mathematical one 
to answer this question, it would help us better understand why quantum mechanics is 
the way it is [6, 7, 8]. 

From the information theoretical point of view, it is natural to ask if the superstrong 
nonlocality, i.e., the nonlocal correlations exceeding Tsirelson's bound, could enhance 
our ability to send classical information to a distant receiver [4]. Suppose that Alice is 
trying to send classical information to distant Bob by using the assistance of preshared 
nonlocal correlations. By the no-signalling condition, if no classical communication from 
Alice to Bob is performed, Bob's information gain is zero bit. In other words, zero bit of 
classical communication can produce no more than zero bit of classical information gain 
to the receiver. On the other hand, it might be possible that m(> 0) bits of classical 
communication produces more than m bits of classical information gain to the receiver. 
The possibility of such an implausible situation would be related to the strength of the 
nonlocal correlations. Especially, one may expect that Tsirelson's bound is derived from 
the impossibility condition of such a situation. 

Motivated by such a consideration, information causality is proposed as an answer 
to the question [4]. Information causality is the condition that "in the bipartite 
nonlocality- assisted random access coding protocols, the total amount of the receiver's 
information gain cannot be greater than the amount of classical communication allowed 
in the protocol. This condition is never violated in the classical and the quantum 
theory, whereas it is violated in the theory in which the nonlocal correlation exceeding 
Tsirelson's bound is allowed. It implies that Tsirelson's bound is derived from this 
purely information theoretical principle. Thus information causality is regarded as one 
of the basic informational principles at the foundations of quantum mechanics. 

In [4], it is proved that information causality is never violated in any no-signalling 
theory in so far as we can define "mutual information" satisfying five natural properties 
in the theory. Conversely, the violation of Tsirelson's bound, which implies the violation 
of information causality, indicates that at least one of the five properties of the mutual 
information is missed. Then it is natural to ask another question that which of the five 
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properties the violation of Tsirelson's bound implies. 

In order to answer this question, we need to define "mutual information" in the 
form that is applicable to general probabilistic theories. Several investigations have been 
made along this line. In [18, 19], a generalized entropy H is defined in the form that is 
applicable to general probabilistic theories, and then a mutual information is defined in 
terms of the generalized entropy by I(A : B) := H(A) + H(B) —H(A, B). By using this 
mutual information, it is proved that the data processing inequality is not satisfied in the 
theory in which Tsirelson's bound is violated. Similar results are also obtained in [20, 21]. 
However, the way of defining the entropies in their approaches are mathematical, and 
it is not clear whether such a definition fits into a natural extension of the concept of 
information in general settings. Note that in classical and quantum information theory, 
the operational meaning of the entropy and the mutual information is given by the 
source coding theorems and the channel coding theorems. In [19], a coding theorem 
analogous to the Schumacher compression theorem [12] in the quantum information 
theory is investigated using their generalized entropy. However, their consideration is 
only applicable under several restrictions. As it is discussed in [18], we need to seek 
definitions of the generalized entropies and mutual informations based on the analysis 
of the data compression or the channel capacities. Such an approach is also attempted 
in [11]. 

In this paper, motivated by the discussions in [18] and by the attempt in [11], 
we introduce an operational definition of the mutual information that is applicable 
to any general probabilistic theories. This is a generalization of the quantum mutual 
information between a classical system and a quantum system. Unlike the previous 
entropic approaches, we directly address to the mutual information. The generalization 
is based on the channel coding theorem. Thus the generalized mutual information 
inherently has an operational meaning as the transmission rate of classical information, 
or the classical information capacity of the physical system. Our definition does 
not require any mathematical notion such as the state space or the fine-grained 
measurement. The generalized mutual information is defined between a classical system 
and a general probabilistic system, and is not applicable to two general probabilistic 
systems, but it is sufficient for analyzing the situation describing information causality. 
The generalized mutual information satisfies four of the five properties of the mutual 
information except the chain rule. It automatically implies that the violation of 
Tsirelson's bound indicates the violation of the chain rule. 

By using the generalized mutual information, we further investigate the derivation 
of Tsirelson's bound in terms of information causality. We formulate a principle, which 
we call " no- super signalling condition' 1 , that the assistance of nonlocal correlations does 
not increase the capacity of classical communication. We prove that this condition is 
equivalent to the no-signalling condition. This result is similar to the result obtained 
in [19], but now become operationally well corroborated. Applying this result to the 
analysis of information causality, we argue that information causality is not referring 
to the difference between the amount of classical communication and the receiver's 
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information gain as it was thought to be. Instead, information causality imposes an 
upper bound on the efficiency of random access coding, and is essentially a matter of 
one party. It turns out that the efficiency of random access coding is closely related to 
the chain rule. As an example for this fact, we show that we can restrict the state space 
of one gbit from the chain rule. 

This paper is organized as follows. In Section 2, we give a brief review of information 
causality. In Section 3, we give the definition of the generalized mutual information, and 
show that Tsirelson's bound is derived from the chain rule. In Section 4, we prove that 
the generalized mutual information is indeed a generalization of the quantum mutual 
information. In Section 5, we give the formulation of no-supersignalling condition, and 
give the proof that the condition is equivalent to the no-signalling condition. In Section 
6, we discuss the relation between the chain rule and random access coding, and show 
that information causality is a matter of one party. In Section 7, we show that we 
can restrict the state space of one gbit by assuming the chain rule. We conclude with 
discussions in Section 8. 



2. Review of information causality 

Information causality is the principle that "the total amount of classical information gain 
that the receiver can obtain in the bipartite nonlocality- assisted random access coding 
protocol cannot be greater than the amount of classical communication that is allowed in 
the protocol' . Suppose that a string of N random and independent bits X — X ± , • • • , X n 
is given to Alice, and a random number k e {1, • • • ,n} is given to distant Bob. The 
task is for Bob to correctly guess X k under the condition that they can use a preshared 
resource of correlations and an m-bit one way classical communication from Alice to 
Bob (see Figure 1). To accomplish this task, Alice first performs a measurement on her 

— * 

part of the resource (denoted by A in the figure), depending on X. She then constructs 
a m-bit message M from X and the measurement outcome, and sends it to Bob. Bob, 
after receiving M, performs a measurement on his part of the resource (denoted by B in 
the figure), depending on M and k. From the outcome of the measurement he computes 
his guess G k for X k . The efficiency of their protocol to accomplish this task is quantified 
by 

n 

J :=Y,Ic{X k :G k ) , (1) 

k=l 

where Ic{X k : G k ) is the classical (Shannon) mutual information between X k and G k . 
Information causality is the condition that, whatever strategy they take and whatever 
preshared correlation allowed in the theory they use, 

J <m (2) 

must hold for all m > 0. The derivation of Tsirelson's bound in terms of information 
causality consists of the following two theorems that are proved in [4]. 
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Figure 1. Nonlocality- assisted random access coding. The task is for Bob to correctly 
guess Xfc, where A: is a random number unknown to Alice. 



Theorem 2.1 If we can define a function I(A : B) satisfying the following properties 
in the general probabilistic theory, J < m holds for all m > 0. The properties are 

• Symmetry : I (A : B) = I(B : A) for any system A and B. 

• Nonnegativity : I(A : B) > for any system A and B. 

• Consistency : If both of the systems A and B are in classical states, I (A : B) 
coincides with the classical mutual information. 

• Data Processing Inequality : Under any local transformation that maps the states 
of system B into the state of another system B' without post-selection, I(A : B) > 
I {A : B'). 

• Chain Rule : For any system A, B and C, conditional mutual information defined 
by I(A : B\C) := I (A :B,C)- I (A : C) is symmetric in A and B. 

Theorem 2.2 If there exists a nonlocal correlation exceeding Tsirelson's bound, we can 
construct a nonlocality-assisted communication protocol by which J > m is achieved. 

Theorem 2.1 guarantees that both of the classical theory and the quantum theory 
satisfy information causality. Theorem 2.2 implies that all "supernonlocal" theories, 
i.e., the general probabilistic theories in which the existence of nonlocal correlations 
stronger than Tsirelson's bound is allowed, do violate information causality. Therefore, 
in any supernonlocal theory, at least one of the five properties of the mutual information 
is missed for any definition of the mutual information. Conversely, if we could find a 
definition of the mutual information that is applicable for any general probabilistic 
theory, and if it always satisfies four of the five properties, the remaining one property 
could be regarded as one of the basic informational relations at the foundation of 
quantum mechanics. 



3. Generalized mutual information 



Suppose that there are a classical system X and a system S that is described by a 
general probabilistic theory. The states of X are labeled by a finite alphabet X. 
For each state x of X, the corresponding state of S denoted by <p x is determined. 
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Figure 2. The channel that we consider to define the mutual information between 
the system X and the system S. It has a classical system as the input system and a 
general probabilistic system as the output system. 



The state of the composite system XS is determined by the probability distribution 
p(x) = Pr(X = x), which represents the probability that the system X is in the state 
x, and the corresponding state 4> x of S. Thus the state of the composite system XS is 
identified with an ensemble {p(x), (f) x }xex- To define a generalized mutual information 
Ig{X : S) between the system X and the system S in the state {p(x), (f) x } x ex, we 
analyze the classical information capacity of a channel that outputs the system S in the 
state <p x according to the input X = x (Figure 2). As usually considered in information 
theory, the sender Alice, who has access to X, tries to send classical information to the 
receiver Bob, who has access to S, by using the channel many times. Suppose that they 
use / identical and independent copies of this channel. Let X±, ■ ■ ■ ,Xi be the inputs of 
the I channels and S±, ■ ■ ■ , Si be the corresponding output systems. 

Alice's encoding scheme is determined by a codebook. Let w G {1,---,N} be a 
message that Alice tries to communicate, and the codeword x l (w) = xi(w) ■ ■ -xi(w) be 
the corresponding input sequence to the channels. The codebook C is defined as the list 
of the codewords for all messages by 

xi(l) ••• " 
C:= I -. -. ■ (3) 

Xl (N) ■■■ X{(N) _ 

The letter frequency f(x) for the codebook is defined by 

!{x) , = ffiMkW^I|MUjlM« {x e x) . (4 ) 

For the given probability distribution {p(x)} xG x, the tolerance r of the code is defined 
by 

r := m&x\p(x) - f(x)\ . (5) 

By performing a decoding measurement on the output systems Si,---,Si, Bob 
tries to guess what the original message w is. Let T> denote the decoding measurement. 
Note that, in general, the decoding measurement is not one in which Bob performs 
a measurement on each of S\, - ■ ■ , Si individually, but one in which the whole of the 
composite system S± ■ ■ ■ Si is subjected to a measurement. Let W, W be Alice's original 
message and Bob's decoding outcome, respectively. The average error probability P e is 
defined by 

1 N 

P e'- = JjJ2 Fl (W^u\W = u). (6) 
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The pair of the codebook C and the decoding measurement V is called an (N, I) code. 
The ratio logN/l is called the rate of the code, and it represents how many bits of 
classical information is transmitted per use of the channel. 

Definition 3.1 A rate R is said to be achievable with p(x) if there exists a sequence of 
(2 lR ,l) codes such that 

(i) PP -> when I oo, 

(ii) — > when I — > oo, 

(iii) All codeletters Xk(w) in are elements of X := supp p = {x\x e Af,p(a;) 7^ 0}. 

Definition 3.2 The mutual information between a classical system X and a general 
probabilistic system 5, denoted by I G (X : 5), is the function that satisfies the condition 
that 

(i) A rate R is achievable with p(x) ii R < I G (X : S), 

(ii) A rate R is achievable with p(:r) only if R < I G (X : S). 

We also define I G (S : X) by 7 G (S : X) := I G (X : S). 

Theorem 3.3 I G (X : S) exists and satisfies J G (X : S) < H(X). Here, H(X) is the 
Shannon entropy of the system X defined by H{X) := — J2xexP( x ) ^ogp(x). 

Proof. First we prove the existence of R* := sup {R\R is achievable with p (x)}. 
Consider a (2 lR , I) code and suppose that Alice's message W = 1, • • • , 2 lR is uniformly 
distributed. Let I', H' be the mutual information and the entropy when the input 
sequence is the codewords corresponding to the uniformly distributed message W . By 
Fano's inequality, we have 

H\W\W) < P®IR+ 1 (7) 

where pP = P(W ^ W). Thus 

IR = H'{W) = I'(W : W) + H'(W\W) 

< l'(X l : W) +P®IR+ 1 

< H'{X l ) + PPlR + 1 . (8) 

Here, we use the data processing inequality in the first inequality. By introducing a 
classical variable K that indicates k with the probability distribution P(K — k) — l/l, 
we also have 

1 

H\X l ) < ^ H\X k ) = IH\X\K) < IH'(X) . (9) 
k=\ 

From (8) and (9), we obtain 



p(0 >1 1 
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If R is achievable with p(x), there exists a sequence of (2 lR , I) codes satisfying pP ->• 
and H'{X) ->■ when / -» oo. Thus i? < H{X). Hence iT exists and satisfies 

iT < if (A"). 

Next we prove that any rate R < R* is also achievable with p(x). Let {(C*^, T>*^)}i 
be the sequence of (2 lR *,l) codes with all codeletters in X that satisfies P e *^ — > and 
r *(0 o. For arbitrary < A < 1, define another codebook by using C*( A 
for the first \l codeletters and by choosing the last (1 — A)/ codeletters arbitrarily 
so that the total tolerance is sufficiently small. Also define the corresponding decoding 
measurement T>® as the measurement in which the output system S\ • ■ ■ S\i is subjected 
to the decoding measurement X>*W and the output system S\i+i • • • Si is ignored. The 
code sequence {(C^\V^)}i constructed in this way is a sequence of (2 lXR * , I) codes with 
all codeletters in X that satisfies pT 1 — > and — > 0. Thus R = XR* is achievable 
with p(x). Hence we obtain R* — I G (X : S). □ 

Note that I G (X : S) is a function of the state T := {p(x), (f) x }xex of the composite 
system XS. To emphasize this, we sometimes use the notation I G (X : S)r- Since R = 
is always achievable, Ig{X : S) is nonnegative. Shannon's noisy channel coding theorem 
guarantees that Ig{X : S) coincides with the classical mutual information Ic(X : S) if 
S is a classical system. The generalized mutual information satisfies the data processing 
inequality as follows. 

Property 3.4 Let £s~>S' be any local transformation that maps the states of a general 
probabilistic system S into the state of another general probabilistic system S'. If Ss^s' 
contains no post-selection, the generalized mutual information does not increase under 
this transformation, i.e., I G (X : S) > I G (X : S'). Similarly, I G (X : S) > I G (X' : S) 
under any local transformation £x->x' that maps the states of a classical system X into 
the state of another classical system X' without post-selection. 

Proof. Here we only prove the former part. For the latter part, see Appendix A. 
Consider two channels, the channel I and the channel II (see Figure 3). Depending on 
the input X — x, the channel I emits the system S in the state <f> x , and the channel II 
emits the system S' in the state <fi' x = £s->s'(<f>x)- Lt is only necessary to verify that if 
a rate R is achievable with p(x) by the channel II, R is also achievable with p(x) by 
the channel I. Let {(C'^\V^)}i be a sequence of (2 lR ,l) codes for the channel II with 
the average error probability P 1 ^ and the tolerance r'^. From the code (C'^\V^), 
construct a (2 lR , I) code (C«, f or the channel I by C® = C' W and pO = V'^oSf^,. 
Here, T>'^ oSf^g, represents a process in which first Ss^-s 1 is applied to each of Si, ■ ■ ■ , Si 
individually and then the decoding measurement T>'^ is performed on the total output 
system S[ ■ ■ ■ S' t . The average error probability and the tolerance of this code are given 
by pP = Pe^ and = r'^, respectively. Hence, if P'e^ — > and r'^ — > 0, we also 
have pP — > and — > 0, and thus R is achievable with p(x) by the channel I. □ 

In the general probabilistic theories, a measurement on a system S without post- 
selection is described by a probabilistic map Su that maps the states of S into the states 
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Figure 3. The channel II defined as the combination of the channel I and £s->S'- 

of a classical system T s . T s represents the register of the measurement outcomes. As 
a special case for Property 3.4, we have I G (X : T s ) < I G (X : S) under £ M , which is a 
generalization of Holevo's inequality. Let us define the accessible information I acc (X : S) 
by 

J acc (X:5):=max/ c (X:T 5 ), (11) 

where the maximization is taken over all possible measurements on S. Then we have 
< LUX : S) < I G (X : S). 

Property 3.5 I acc (X : S) = if and only if I G (X : S) = 0. 

Proof. The "if" part is a direct consequence of Property 3.4. To prove the "only 
if" part, note that I aC c(X : S) — implies that <p x is the same state for all re e X. 
Thus all states 4> x := <j> Xl <f> X2 • • • (j) Xl (x £ X 1 ) are the same states of the composite 
system SiS 2 ■ ■ ■ S t . It implies that no code with all codeletters in X can be used for 
transmitting information. □ 

To summarize, our generalized mutual information satisfies the following properties. 

• Symmetry: I G (X : S) = I G (S : X). 

• Nonnegativity: I G (X : S) > 

• Consistency: When S is a classical system, I G (X : S) = I G (X : S). 

• Data Processing Inequality: I G (X : S) > I G (X' : S') under local stochastic maps 
£x^x' and £s^s> that contain no post-selection. 

Thus, from Theorem 2.1 and Theorem 2.2, we conclude that the chain rule of 
the generalized mutual information should be violated in any supernonlocal theory. 
Conversely, the chain rule implies Tsirelson's bound. 

4. Quantum mutual information 

The generalized mutual information defined by Definition 3.2 looks different from the 
quantum mutual information Iq(X : S) defined by 

I Q (X : S) p = H(S) P - Y.P( x ) H ( S )p* > (12) 

where 

p = J2p( x )\ x )( x \ x ( 13 ) 
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p = J2p(x)p x . (14) 

However, with a slight generalization of the Holevo- Schumacher- Westmoreland theorem, 
it is shown that these definitions are equivalent in quantum mechanics. 

Theorem 4.1 In quantum mechanics, the generalized mutual information coincides 
with the quantum mutual information, i.e., 

I G (X : S) rp = I Q (X : S) p (15) 

where 

P = ^2p{x)\x)(x\ x ®p% (16) 
and T p = {p(x),p x } xeX . 

Proof. To prove this, it is only necessary to verify the following two statements: 

(i) A rate R is achievable with p(x) if R < Iq(X : S) p , 

(ii) A rate R is achievable with p(x) only if R < Iq(X : S) p . 

The first statement is proved in [13, 14] by using random code generation, and the 
second statement is proved in the following way. Consider a (2 lR , I) code and suppose 
that Alice's message W — 1, • • • , 2 lR is uniformly distributed. Similarly to (8), we have 

IR = H'{W) = I'(W : W) + H'{W\W) < I' Q (X l : S l ) + P^ l) lR + 1 . (17) 

Here, we use the data processing inequality. We also have 

i 

l' Q (X l : S l ) = H'(S l ) - H\S l \X l ) = H'(S l ) -J2 H '( S k\Xk) 

k=i 

i i 



< ^2(H'(S k ) - H'(S k \X k )) = Y, 1 ^ ■ S k) 



k=l k=l 

= U' Q (X :S\K) = U' Q (X, K:S)- ll' Q (K : S) 
<U' Q (X,K:S) = II' Q (X:S). (18) 

In the first line, we use the fact that the state of S k depends only on X k . The first 
inequality is from the subadditivity of the von Neumann entropy. The last equality 
holds since K — > X — > S forms a Markov chain. From (17) and (18), we obtain 



P (0 >1 _«^)_i 



R IR ■ (19) 

If R is achievable with p(x), there exists a sequence of (2 lR , I) codes satisfying pP — > 
and I' Q (X : S) ->■ I Q (X : S) p when I ->■ oo. Thus R < I Q (X : S) p . □ 
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Figure 4. The situation that the no-supersignalling condition refers to. The amount 
of information about X contained in M and B is quantified by Ig{X, M, B). 

5. No-supersignalling condition 

In this section, we formulate a principle that we call "no-supersignalling condition" to 
further investigate the argument of information causality. Suppose that Alice is trying to 
send to distant Bob information about n independent classical bits X\, ■ ■ ■ , X n , under the 
condition that they can only use an m bit classical communication M from Alice to Bob 
and a supplementary resource of correlations preshared between them (see Figure 4). 
The situation is similar to the setting of information causality described in Section 2, but 
now, we do not introduce random access coding. Instead, we evaluate Bob's information 
gain by Ig{x : M,B). We define that no-supersignalling condition is satisfied if 
Ig{X : M, B) < m holds for all m > 0. The condition indicates that "the assistance 
of correlations cannot increase the capacity of classical communication" , which was the 
original motivation for introducing information causality. In what follows, we prove that 
the no-supersignalling condition is equivalent to the no-signalling condition. 

Lemma 5.1 For any classical system X, Y and any general probabilistic system S, if 
I acc (X :S) = then I acc (X : S,Y) < H(Y). 

Proof. Consider a channel with the input system X and the output system S, Y (see 
Figure 5). Let Z be the set of all measurements on S, and p(t\x, y, z) be the probability 
of obtaining the outcome t when the measurement z G Z is performed on the system 
S in the state <p xy . To achieve I aC c{X : S, Y), the receiver performs a measurement on 
S possibly depending on Y. Let z(y) be the optimal choice of the measurement when 
Y = y. The probability of obtaining the outcome t when X = x and Y = y is given by 

Pi(t\x, y) := p(t\x, y, z(y)) . (20) 

We define 

Px(t, x, y) := p(x, y)pi(t\x, y) = p(x, y)p(t\x, y, z(y)) . (21) 

The condition I acc (X : S) = implies that for all z G Z, 

^p(x, y)p(t\x, y, z) = p(x)p 2 (t\z) , (22) 
y 
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Figure 5. The channel that we consider to prove Lemma 5.1. For each pair of the 
input X = x and the output Y = y, the corresponding state <f> xy of the output system 
S is determined. 



where 

p 2 (t\z) := ^2p(x, y)p(t\x, y, z) . (23) 

Thus we obtain 

pi(f , x, y) = p(x, y)p(t\x, y, z(y)) 

< ^2p(x,y')p(t\x,y',z(y)) 
y' 

= p{x)p 2 {t\z{y)) . (24) 

The accessible information I acc (X : S,Y) is equal to the mutual information Ic{X : 
T,Y) calculated for the probability distribution pi(t,x,y). Therefore 

I acc (X:S,Y) = I c (X:T,Y) Pl 

Pi(t,x,y) 



= ^2pi(t,x,y)log 



p(x) Pl (t,y) 
hi. p(x)Pi(t,y) 

= ir(y)-^(pi(*»i/)IM*,i/)) 
< # (>1 • 

In the first inequality, we used (24). In the next equality we defined a probability 
distribution P2(t,y) := P2(t\z(y))p(y). The last inequality is from the nonnegativity of 
the relative entropy. □ 

Theorem 5.2 For any classical system X, Y and any general probabilistic system S, 
if I G (X :S) = then I G (X : S,Y) < H{Y). 

Proof. Consider a (2 lR , I) code with all codeletters in X for the channel shown in Figure 
5, and suppose that Alice's message is uniformly distributed. By Fano's inequality, we 
have 

I'{W :W)>IR-1- PPlR . (25) 
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By the data processing inequality, we also have 

I'{W : W) < l'(X l : Y l ,T s i) < l' acc (X l : Y l ,S l ). (26) 

From Property 3.5, the condition Ig(X : S) — implies I' acc (X l : S l ) = 0. From Lemma 
5.1, we obtain 

I' acc (X l :Y l ,S l )<H'(Y l ), (27) 

and thus 

I'(W : W) < H'{Y l ) < IH'{Y) . (28) 
Hence we obtain 

(1 - P®)R < H\Y) + y . (29) 

If R is achievable with p(x), there exists a sequence of (2 m , /) codes with all codeletters 
in X that satisfies P^ ->■ and if'(y) ->■ when Z ->■ oo. Thus, for any it! that is 

achievable with p(x), we have R < H(Y). It implies I G (X :Y,S) < H(Y). □ 

Corollary 5.3 No-supersignalling condition is equivalent to the no-signalling condition. 



— * — * 

Proof. By setting X = X and Y = M in the result of Theorem 5.2, for all m > 0, 
we obtain Ig(X : M,B) < H{M) < m from the no-signalling condition. Note that 

— * — * 

Iacc{X : B) = is required by the no-signalling condition and thus Ig(X : B) = by 
Property 3.5. Conversely, for m — 0, no-supersignalling condition that Ig(X : B) — 0, 
and it is equal to the no-signalling condition. □ 



6. The chain rule and random access coding 

In this section, by using the result obtained in Section 5, we discuss the relation among 
information causality, random access coding and the chain rule. Let us define 

A NSS :=I G (X:M,B)-m, (30) 
A RAC :=J-I G (X:M,B), (31) 
A IC := A NSS + A RAC = J - m . (32) 

Anss quantifies the capacity of classical communication assisted by correlations, and 
A R ac quantifies the efficiency of random access coding. No-supersignalling condition is 
equivalent to Anss < and information causality is equivalent to Aic < 0. 

Theorem 2.2 states that, if Tsirelson's bound is violated, we have A IC > 0. 
Therefore the violation of Tsirelson's bound implies at least either A NS s > or 
Arac > 0. Then we would ask the following question: which does the violation of 
Tsirelson's bound imply, A NSS > or A R ac > ? As we proved in Section 5, A NS s < is 
satisfied by all no-signalling theories. Thus the answer is that the violation of Tsirelson's 
bound only implies A R ac > 0. Therefore, information causality does not refers to the 
difference between the amount of classical communication and the receiver's information 
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J = Y / Ic(X k -.G k ) 

k 




Chain 
Rule 



Figure 6. The relation among information causality, the no-supersignalling or the 
no-signalling condition and random access coding. Information causality refers to the 
gap in (1) represented by A I( > No-supersignalling refers to the gap in (2) represented 
by Aic, and it is irrelevant to Tsirelson's bound. The gap in (3) represented by Arac 
is crucial for the derivation of Tsirelson's bound. A R ac is bounded above due to the 
chain rule of the generalized mutual information. 



gain. Information causality refers to how much random access coding can be efficient, 
and is essentially a matter of one party. In the derivation of Tsirelson's bound, non- 
existence of super-efficient RAC represented by A R ac < is critical [16] (see Figure 
6). 

It is proved in [4] that, if the assumption in Theorem 2.1 is satisfied, we have 
Arac < 0. Since our generalized mutual information satisfies all those properties 
except the chain rule, A RA c > implies the violation of the chain rule. Let X, Y 
be two classical systems and S be a general probabilistic system. The chain rule of the 
generalized mutual information is given by 

I G (X, Y:S) + I G (X : Y) = I G (X : S) + I G (Y : S, X) . (33) 

Especially, in the case where I G (X : Y) — 0, we have 

I G (X, Y:S) = I G (X : S) + I G (Y : S,X) . (34) 

Each term in (34) has an operational meaning as the information transmission rate by 
definition. The relation is satisfied in both classical and quantum theory, but is violated 
in all supernonlocal theories. Thus we can conclude that this highly nontrivial relation 
gives a strong restriction on the physical theories. However, the operational meaning of 
this relation is not clear so far. 

7. Restriction on one gbit state space 

To investigate the relationship between random access coding and the chain rule, we 
consider a gbit. A gbit is the counterpart of a qubit in general probabilistic theories 
[17]. Here, we assume no property for a gbit such as the dimension of the state space, 
the possibility or impossibility of various measurements and transformations. Instead, 
we define a gbit as the minimum unit of information in the theory, and we require that 
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the classical information capacity of one gbit is not more than one bit. Thus we require 

I G (X : 5 lgb ) < 1 (35) 

for any classical system X. When X is a classical system composed of two independent 
and uniformly random bits X and X 1 , we have 

/ G (Xo,X i: S lgb )<l. (36) 
By the chain rule, we have 

Ig(Xo,X 1 : S'igb) = Ig(X : Si gb ) + Ig(X 1 : Si gh ,X ) . (37) 
By the data processing inequality, we also have 

Ig(X : S'igb) + Ig{X\ : S lgh ,X ) > I acc (X : S'igb) + I aC c{Xi : Si gb ) . (38) 
Thus the chain rule implies 

Iacc(X : Sigb) + Iacc{X\ Si g b) < 1 . (39) 

We consider the success probabilities of the decoding measurements on S ig b for X Q and 
X\. For simplicity, we assume that the optimal measurement performed on Si g b to 
decode X or X\ has two outcomes t = 0, 1. Let P(t\m, xo,xi) be the probability of 
obtaining the outcome t when X = x , X 1 = x 1 and the measurement m is performed. 
The index m — 0,1 corresponds to the optimal measurement for decoding X , Xi, 
respectively. The list of all probabilities {P(t\m, x , Xi)} t)m)XO)X1= o,i can be regarded as 
representing a "state". We compare the state space of a qubit and the state space 
determined by (39). For further simplicity, we assume that for all x and X\, 

P(t = Xo\m = 0,x ,xi) = — ^ — (0 < a < 1) , 



p(t = Xl \ m = l, Xo , Xl ) = (0 < /5 < 1) . 



Then we have 



I acc (X : S^b) = Ic{xi : t\m = 0) = 1 - H(x \t,m = 0) 

= 1 - H(x @t\m = 0) = l-h ( ^-t^ ) , (40) 



Iacc(Xi : S^b) = l-h( ^ ) . (41) 



and 

L„JX, :S^) = l-h(- 2 

Here, h(x) is the binary entropy defined by h(x) = —a; log a; — (1 — x) log (1 — x). From 
(39), (40) and (41), we have 

»p±*) + k (i±i!)*i. m 

This inequality gives a nontrivial restriction on the state space of one gbit (see Figure 7). 
It implies that the chain rule imposes a restriction on the possibility of "superstrong" 
random access coding. It is shown in Appendix B that in the case of one qubit, the 
obtainable region is given by a 2 + f3 2 < 1. 
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Figure 7. Comparison of the state space of a qubit and the boundary given by the 
chain rule. The red region indicates the state space of a qubit given by a 2 + f3 2 < 1. 
The blue region in addition to the red region indicates the region defined by (42). 

8. Conclusions and discussions 

We have defined a generalized mutual information between a classical system and 
a general probabilistic system. Since the definition is based on the channel coding 
theorem, our generalized mutual information inherently has an operational meaning as 
the information transmission rate. We have shown that the mutual information coincides 
with the quantum mutual information if the system is quantum. The generalized mutual 
information satisfies nonnegativity, symmetry, the data processing inequality, and the 
consistency with the classical mutual information. However, it does not always satisfy 
the chain rule. 

By using the generalized mutual information, we have analyzed the derivation of 
Tsirelson's bound based on information causality defined in terms of the efficiency of 
random access coding. We showed that the chain rule of the mutual information, which 
is satisfied by both classical and quantum theory, is violated in any theory in which 
the existence of nonlocal correlations exceeding Tsirelson's bound is allowed. Thus we 
conclude that the chain rule implies Tsirelson's bound. 

We formulated a condition (no-supersignalling condition) that the assistance of 
preshared correlation cannot increase the capacity of the classical communication. We 
proved that this condition is equivalent to the no-signalling condition. Based on this 
result, we argued that information causality is essentially a matter of one party referring 
to the efficiency of random access coding. The efficiency of random access coding is 
restricted by the chain rule of the mutual information. As an example for this fact, we 
derived a restriction on the state space of one gbit from the chain rule. 

Although the operational meaning of the generalized mutual information is clear, 
we have not yet succeeded in finding out a clear operational meaning of the chain rule. 
In classical and quantum Shannon theory, the chain rule appears in a lot of proofs of 
coding theorems. Our result shows that it is a highly nontrivial fact that the chain rule 
is satisfied in classical and quantum theory. Therefore, investigation of the meaning of 
the chain rule would lead us to a more profound understanding of the informational 
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foundations of quantum mechanics. 

On the other hand, our definition of the generalized mutual information would 
not be the only way to generalize the quantum mutual information. It would also be 
fruitful to seek for other operationally motivated definitions of the generalized mutual 
information and compare them with each other. 
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Appendix A. Data processing inequality 

We prove the latter part of Theorem 3.4, which states that under any local stochastic 
map Sx^x' that contains no post-selection, we have 

Ig(X : S) > I G (X' : S) . (A.l) 

The effect of Sx^-x' is determined by the conditional probability distribution pg{x'\x) 1 
where x and x' denote the states of X and X', respectively. Let {p(x), (p x } X £X be 
the state of XS before applying S x ^x'- We can define probability distributions 
ps(x,x') = p(x)p £ (x'\x), p(x') = Y^xPs( x ^ x ') an d Ps(x\x') = p £ (x,x')/p(x') for x G X 
and x 1 G X' = {x'\x' G X',p(x') ^ 0}. Note that p £ (x\x') = for x ^ X = {x\x G 
X,p(x) 7^ 0}. The state of X'S after applying £x^x> is {p{x'), <f) x '}x'&x', where <p x i is 
the mixture of <p x with probability given by ps{x\x'). We assume that \X\, \X'\ < oo. 

To prove (A.l), consider two channels, the channel I and the channel III (see Figure 
Al). The channel I outputs the system S in the state <p x according to the input X = x, 
and the channel III outputs the system S in the state </v according to the input X' = x'. 
It is only necessary to show that if a rate R is achievable with p(x') by the channel III, 
R is also achievable with p(x) by the channel I. Consider a sequence of (2 m , /) codes 
for the channel III that satisfies 

(i) Pe (l) ->■ when I ^ oo, 

(ii) r /(z) when / oo, 

(iii) All codeletters in C'® are elements of X'. 

Such a sequence exists if R is achievable with p(x') by the channel III. From the code 
(C'W,X>'W), we randomly construct (2 lR , I) codes (C w ,£> w ) for the channel I in the 
following way. 

• For any w and k (1 < w < 2 lR , 1 < k < I), generate the codeletter Xkiw) 
randomly and independently according to the probability distribution P(xk(w) = 
x) =p e (x\x' k (w)). 
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Figure Al. The channel III defined as the combination of Ex^x 1 and the channel I. 
This channel as a whole is equivalent to a channel with the input x' and the output 



• Regardless of the randomly generated codebook use the same decoding 

measurement T>® = V®. 

Let Pf ] be the average error probability of the code (C^ l \T>^) defined by 

2 lR 

P f ■■=^mJ2 P ^^u\W = u, C«) . (A.2) 

Z u=l 

Averaging Pf (i) over all codebooks that are randomly generated, we obtain 

P«):=J2P(C^Pf\ (A.3) 

c(') 

where P(C^) is the probability of obtaining the codebook as a result of random 
code generation. In Lemma A.l, we show that — > in the limit of / — > oo. In 
Lemma A.2, we prove that for sufficiently large /, the tolerance of the codebook 
is almost equal to with arbitrarily high probability. Finally, we give the proof for 
(A.l) in Theorem A.3. 



Lemma A.l 



lim P e (0 = . (A.4) 

l— >oo 



Proof. PP defined by (A.3) is calculated to 

p« = J2 p ( c(0 ) x w E p ^ * u \ w = M > c(0 ) 



2 lR 

cm u=i 



2 lR 

= w E E ^ ( ^ u\W = u, C«) 



2 lR 

u=l C (0 



2 tit 

= 4E^^I^ = w )' (A-5) 



2 m 

M = l 
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where 

P(W ^ u\W = u) := P(C (l) )P(W ^u\W = u, C {1) ) . (A.6) 

C(0 

The codebook is determined by the codeletters x k (w) (1 < w < 2 lR , 1 < k < I). Due 
to the way of randomly generating the code, the probability of obtaining the codebook 
such that x k (w) = £ wk (1 < w < 2 lR , 1 < k < I) is given by 

P(C W ) = P({x k (w)} w , k = {C wk } w ,k) 

2 lR I 

w=l k=l 
2 lR I 

= J] I[P£( X = ^k\x' = x' k {w)) . (A.7) 

w=l k=l 

Let D((j) Xl ■ ■ ■ 4> Xl ) be the result of the decoding measurement T>^ on the composite 
system Si • • • Si in the state (f> xi ■ ■ ■ (f> Xl . We have 

P{W ^ u\W = u,C (l) ) = P{D((p Xl{u) ---(f) Xl(u) ) ^ u\{x k (w)} Wtk = {£ W k}w,k) 

= P(D(<p xi{u y<p xi{u) ) £u) , (A.8) 

and we obtain 
P{W ^u\W = u) 

= Y ' ' ' ^ u\{x k (w)} W)k = {£ wk }w,k) x P({x k (w)} Wtk = {£ wk }w,k) 

{£wk} w . k 

= Y P ( D (0Mu)- ■ -^(u)) 7^ U) X P({x k (u)} k = {£ uk } k ) 

i 

= ^ PiD^x^u) ■ ■ ■ ^(u)) ^ u) x Y[pe(x = £uk\x' = x' k (u)) . (A.9) 

{€wk}k fc=1 
On the other hand, the error probability for the message w when the channel III is used 
with the code (C'W,!)'®) is given by 

P\W ^ u\W = u) 

= P{P J (<t ) x 1 (u) ■ ■ ■ (f>x t (u)) 7^ U ) 
I 

= Y\\p £ {x = x k \x =x' k (w))x P{D{(t) x ^ u) ---(t) x ^ u ))^u) . (A.10) 

From (A.9) and (A. 10), we obtain 

P(W ^ u\W = u) = P'{W ^ u\W = u) , (A.ll) 
and consequently 

P® = P'^ . (A. 12) 

Therefore PP when I oo. □ 
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Proof. Let f(x)® and f(x')^ be the letter frequency of the codebook and C ,( - l \ 
respectively. We have 



\f(x)U-p(x)\ 



f(x)® - E p £ (*|*W) 



n'eA" 



< 



< 



/(^) {0 - E /V) ( 'W*i^ 



E /(a/) (,) pf(x|a/) - E P£(^K)p(^ 



x'ex' 



f{x)® - E /V) ( 'W*I^ 



Define 



, (0 |{(A;,«;)|xfc(«;) = x,a4(tw) = x', 1 < A; < 1, 1 < tu < 2 lR }\ 



f(x,xT := 



for x <E X,x' E X' . By using the relation 



(A.13) 



we obtain 



A(x)« 



- E /V) ( V(*K) 



< E^' 



.(0 



/(a^)(0 



Ps(x\x') 



(A.14) 



Applying the weak law of large numbers for each term in the sum, we have A(x)® — > 
(I — > oo) in probability. We also have 

E Ps{x\x') \f{xT -P(x')\ < T>^ ■ \X'\ (A.15) 



x'eX' 



and thus 



lim Vp^W-^| = 0. 



(A.16) 

(A.17) 
□ 



Therefore we obtain 

7-(0 = max \j\x)^ — p{x)\ — >■ m probability . 

Theorem A. 3 R is achievable with p(x) by the channel I. 

Proof. Take arbitrary e,5, rj > 0. From Lemma A.l and Lemma A. 2, for sufficiently 
large / we have 

PP < e (A. 18) 
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and 

Pr{r (z) < 5} > 1 - rj . (A.19) 
Define Cf := {C«|r (0 < 

5}. The average error probability averaged over all codebooks 

in C^P is calculated to 



E C (. )6q <o p(c«) e C (.) 6C co nc®) - E C (o 6 ojo p(cw) i - 77 • 

Thus there exists at least one codebook C« G such that Pf (0 < e' = e/(l - rj) and, 
by definition, < 5. Hence there exists a sequence of (2 m , /) codes for the channel I 
such that pP — > and r'^ — > when / — > oo, and thus P is achievable with p(x) by 
the channel I. □ 



Appendix B. State space of a qubit 

Suppose that information of two independent and uniformly random bits xqXi is encoded 
into the state of a qubit p X0X1 . Let {M™} t=0 ,i be the optimal measurement for decoding 
x m (m = 0, 1), where the mutual information Ic(X m : T) between X m and the 
measurement outcome T is maximized when the measurement m is performed. We 
assume that for all x and x±, 

1 + a 

P{t = x \m = 0, x , n) = tr[M° X0 p X0X1 } = (0 < a < 1) , (B.l) 

P(t = Xl | m = l, Xo , Xl ) = tr[M xl p X0X1 ] = (0 < /5 < 1) . (B.2) 

In what follows, we prove that such a set of the density operators {p XoXl } XQXl and 
POVM operators for the measurements exists if and only if a 2 + (3 2 < 1. Considering 
the parametrization of a qubit state using the Bloch sphere, the sufficiency is obviously 
verified. The necessity is proved as follows. Let r X0Xl be the Bloch vector representation 
of p X0Xl and u, v be those of M° and Mg, respectively. Formally, we have 

Px x 1 = \( I + r ^ox 1 ■ <r) (Kern II < !), ( B -3) 

M° = i(/ + (-l)W), (B.4) 

and 

M} = \{I + (-l)*v.<r), (B.5) 

where cr = (a x , a y , a z ). The optimality of the measurement implies that ||it|| = ||v|| = 1. 
From the condition (B.l) and (B.2), we obtain 



u ■ r 00 = u ■ r 01 = -u ■ n = -u ■ r u = a , 
v ■ r 00 = -v ■ r 01 = v ■ n = -v • r n = ■ 



(B.6) 
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Let r XQXl be the projection vectors of r X()Xl onto the two dimensional subspace spanned 
by u and v. Then we have 

r o + rn = foi + r w = . (B.7) 

and 

w • (Too - r i) = v • (r 00 - f 10 ) = . (B.8) 

Due to the optimality of the decoding measurement, we also have u || (f o + foi) and 
v II (foo + f io)- Thus we obtain u ■ v = 0. Hence 

a 2 + f3 2 = (u ■ r XaXl ) 2 + (v ■ f XQXl ) 2 < \\r XQXl \\ 2 < 1 . (B.9) 

Appendix C. Inclusion relation of the sets of no-signalling correlations 

The inclusion relation of the sets of bipartite and multipartite no-signalling correlations 
are given in (C.l). 

AfS = ATSS dICDCKD QdC (C.l) 
(a) (b) (c) (d) (e) 

AfS is the set of all no-signalling correlations. AfSS is the set of all no-signalling 
correlations that satisfies no-supersignalling condition. By "satisfy" we mean that for 
any communication protocol using that correlation, the condition is never violated. 
Similarly, XC and C1Z are the sets of all no-signalling correlations that satisfy information 
causality and the chain rule, respectively. Q and C are the sets of quantum and classical 
correlations, respectively. D represents the genuine inclusion relation, and D indicates 
that we do not know whether the sets are equivalent or have a genuine inclusion relation, 
(a) is proved in Section 5. (6) is proved in [4]. (c) follows from the discussion in Section 
7. (d) is obvious and (e) is proved in [1]. Recently it is proved from the observation 
of tripartite nonlocal correlations that at least one of (c) and (d) is a genuine inclusion 
relation [22, 23]. 
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